Analyzing the lein-topology function dependency network

Automating recommendations to improve the software architecture


In [1]:
from py2cytoscape.data.cynetwork import CyNetwork
from py2cytoscape.data.cyrest_client import CyRestClient
from py2cytoscape.data.style import StyleUtil

import sand

import matplotlib.pyplot as plt

%matplotlib notebook

graph.from_edges with a list of dictionaries

Use graph.from_edges with an adjacency list consisting of two vertex names and an edge weight represented as a List of Dictionaries.


In [2]:
network_name = "5b2a6e3"
network_collection = "lein-topology"

data_path = "./data/" + network_collection + "-" + network_name
edgelist_file = data_path + ".csv"

edgelist_data = sand.csv_to_dicts(edgelist_file,header=['source', 'target', 'weight'])
edgelist_data[:5]


Out[2]:
[OrderedDict([('source', 'topology.dependencies/dependencies'),
              ('target', 'clojure.core/defn-'),
              ('weight', '1')]),
 OrderedDict([('source',
               'topology.edgelist-test/syntax-quotes-add-seq-concat-list'),
              ('target', 'clojure.core/filter'),
              ('weight', '1')]),
 OrderedDict([('source',
               'topology.dependencies-test/should-compute-fn-calls-in-namespace'),
              ('target', 'clojure.core/defn'),
              ('weight', '1')]),
 OrderedDict([('source', 'example/test-when'),
              ('target', 'clojure.core/cons'),
              ('weight', '1')]),
 OrderedDict([('source', 'leiningen.topology/topology'),
              ('target', 'org.clojure/clojure'),
              ('weight', '1')])]

In [3]:
g = sand.from_edges(edgelist_data)
g.summary()


Out[3]:
'IGRAPH DNW- 107 206 -- \n+ attr: group (v), indegree (v), label (v), name (v), outdegree (v), weight (e)'

Is the graph simple?

A graph is simple if it does not have multiple edges between vertices and has no loops, i.e. an edge with the same source and target vertex.

A graph that isn't simple can cause problems for some network analytics algorithms.


In [4]:
g.is_simple()


Out[4]:
True

Initial Clustering based on Namespace

Groups represent modules or communities in the network. Groups are based on the labels by default.


In [5]:
g.vs['group'][:5]


Out[5]:
[96, 19, 71, 26, 87]

The vertices in the lein topology data set contain fully-qualified namespaces for functions. Grouping by name isn't particularly useful here:


In [6]:
len(set(g.vs['group']))


Out[6]:
107

In [7]:
len(g.vs)


Out[7]:
107

Because sandbook was build specifically for analyzing software and system networks, a fqn_to_groups grouping function is built in:


In [8]:
g.vs['group'] = sand.fqn_to_groups(g.vs['label'])

In [9]:
len(set(g.vs['group']))


Out[9]:
20

This is a much more managable number of groups.

Namespaces may also be useful as a separate attribute on the vertices:


In [10]:
g.vs['namespace'] = sand.namespaces(g.vs['label'])

Use degree centrality to identify candidates to filter from the analysis

Outdegree represents the vertices that have the most dependencies, i.e. call the most number of functions. These functions could potentially be split into smaller, more cohesive functions.

Indegree represents the vertices are depended on the most...changing them will have the most impact on other parts of the system.


In [13]:
from sand.degree import degree_distribution_count

degree, count = degree_distribution_count(g)

In [14]:
plt.title("Degree Distribution")
plt.ylabel("count")
plt.xlabel("degree")
infig, = plt.loglog(degree, count, 'b-', marker='o')



In [15]:
g.vs['outdegree'][:5]


Out[15]:
[12, 0, 15, 0, 13]

In [16]:
g.vs['indegree'][:5]


Out[16]:
[1, 12, 0, 7, 0]

Which vertices have a degree of more than some majority percentage of the maxdegree?


In [17]:
g.maxdegree(mode='IN')


Out[17]:
13

In [18]:
g.maxdegree(mode='OUT')


Out[18]:
15

In [19]:
score = g.maxdegree() * .80

These functions call the highest number of others and could potentially be split:


In [20]:
[v['name'] for v in g.vs.select(lambda vertex: vertex['outdegree'] >= 10)]


Out[20]:
['topology.dependencies/dependencies',
 'topology.edgelist-test/syntax-quotes-add-seq-concat-list',
 'topology.dependencies-test/should-compute-fn-calls-in-namespace',
 'leiningen.topology/topology',
 'topology.symbols/seq-map-zip',
 'topology.dependencies/sources',
 'topology.qualifier/fq-ns']

Changing these functions will have the most impact on other parts of the system:


In [21]:
[v['name'] for v in g.vs.select(lambda vertex: vertex['indegree'] >= 4)]


Out[21]:
['clojure.core/defn-',
 'clojure.core/filter',
 'clojure.core/defn',
 'clojure.core/seq',
 'clojure.core/fn',
 'clojure.core/meta',
 'clojure.core/=',
 'clojure.test/is',
 'clojure.core/symbol',
 'clojure.test/deftest',
 'clojure.core/->>',
 'clojure.core/range',
 'clojure.core/map',
 'clojure.core/list']

In this case, clojure.core and clojure.test namespaces have the most dependencies...unsurprising, given these are the foundational libraries of the language! These are good candidates to filter out of a visualization, since they often don't add deep insight to the aspects of the design specific to the program.

Extract the subgraph of local namespaces based on our filter criteria

There are some analyses where it will be useful to see all the vertices. For the high-level architecture diagram, we can focus on the functions local to the library's namespaces. We'll also keep functions that have side-effects to see if these are isolated to only a few key parts of the program:


In [22]:
# List all patterns of vertex names that we want to keep:
names_to_keep = ('topology', 'clojure.core/*err*', 'clojure.core/println', 
                 'clojure.zip', 'clojure.java.io', 'clojure.tools.namespace',
                 'leiningen.core.eval', 'clojure.repl')

lv = g.vs(lambda v: any(match in v['label'] for match in names_to_keep))
lg = g.subgraph(lv)

# Recompute degree after building the local subgraph (lg):
lg.vs['indegree'] = lg.degree(mode="in")
lg.vs['outdegree'] = lg.degree(mode="out")

lg.summary()


Out[22]:
'IGRAPH DNW- 35 35 -- \n+ attr: group (v), indegree (v), label (v), name (v), namespace (v), outdegree (v), weight (e)'

Visualizing the network in Cytoscape

Verify that Cytoscape is running and get the current version


In [24]:
import sand.cytoscape as sc

sc.print_version()


{
  "apiVersion": "v1",
  "cytoscapeVersion": "3.7.0"
}

Load the network into Cytoscape with a default layout


In [25]:
# Create py2cytoscape client
cy = CyRestClient()

In [26]:
# Optional: delete existing Cytoscape sessions.
cy.session.delete()

In [27]:
# Load the network
network = cy.network.create_from_igraph(lg, name=network_name, collection=network_collection)

Layout


In [28]:
# Apply default layout
cy.layout.apply(name='force-directed', network=network)

Customize the style

Use one of the included themes, or build your own.


In [30]:
style = cy.style.create('Ops')
style.update_defaults(sc.ops)

# Map the label property in the igraph data to Cytoscape's NODE_LABEL visual property
style.create_passthrough_mapping(column='label', vp='NODE_LABEL', col_type='String')

Color vertices by namespace:


In [32]:
from sand.cytoscape import colors


border_colors = {
  'topology.finder': colors.BRIGHT_YELLOW,
  'topology.dependencies': colors.BRIGHT_ORANGE,
  'topology.dependencies-test': colors.BRIGHT_ORANGE,
  'topology.qualifier': colors.BRIGHT_PURPLE,
  'topology.symbols': colors.BRIGHT_BLUE,
  'clojure.core': colors.BRIGHT_RED,
  'clojure.java.io': colors.BRIGHT_RED,
  'topology.printer': colors.BRIGHT_RED,
  'leiningen.topology': colors.BRIGHT_WHITE,
}

fill_colors = {
  'topology.finder': colors.DARK_YELLOW,
  'topology.dependencies': colors.DARK_ORANGE,
  'topology.dependencies-test': colors.DARK_ORANGE,
  'topology.qualifier': colors.DARK_PURPLE,
  'topology.symbols': colors.DARK_BLUE,
  'clojure.core': colors.DARK_RED,
  'clojure.java.io': colors.DARK_RED,
  'topology.printer': colors.DARK_RED,
  'leiningen.topology': colors.DARK_WHITE,
}

style.create_discrete_mapping(column='namespace', col_type='String', vp='NODE_FILL_COLOR', mappings=fill_colors)
style.create_discrete_mapping(column='namespace', col_type='String', vp='NODE_BORDER_PAINT', mappings=border_colors)

In [33]:
cy.style.apply(style, network)

Load layout coordinates from a previous session


In [34]:
positions_file = data_path + "-positions.csv"
sc.layout_from_positions_csv(network, positions_file, cy)

Save the updated layout coordinates if you make changes

A benefit of this workflow is the ability to manually tweak the algorithmic network layout in Cytoscape.

After making changes, save the coordinates for a later session:


In [35]:
sc.positions_to_csv(network=network, path=positions_file)

Generate an SVG export

Position the network in Cytoscape the way you want it, then trigger this export. When iterating, run all cells above, then all cells below this point to avoid race conditions with cytoscape's renderer.


In [36]:
# Hide all panels in the UI
sc.hide_panels()


Out[36]:
True

In [37]:
# Fit to the window:
cy.layout.fit(network=network)

In [38]:
view_id = network.get_views()[0]
view = network.get_view(view_id=view_id, format='view')

In [39]:
# Zoom out slightly:
view.update_network_view('NETWORK_SCALE_FACTOR', 0.65)

In [40]:
# Shift the network to the left:
view.update_network_view('NETWORK_CENTER_X_LOCATION', 150.0)

In [41]:
from IPython.display import SVG, display

svg_data = network.get_svg()
display(SVG(svg_data))


Creator: FreeHEP Graphics2D Driver Producer: org.freehep.graphicsio.svg.SVGGraphics2D Revision Source: Date: Thursday, March 7, 2019 11:44:01 AM CST topology.edgelist-test/should-report-java-interop topology.edgelist-test/should-produce-weighted-edges-for-multiple-calls clojure.tools.namespace.file/read-file-ns-decl topology.finder/dirs->sources topology.finder/source-paths->namespaces topology.finder/find-sources-in-dir topology.edgelist-test/syntax-quotes-add-seq-concat-list clojure.zip/zipper clojure.zip/end? clojure.zip/next clojure.zip/node topology.symbols/seq-map-zip topology.dependencies-test/should-compute-fn-calls-in-namespace topology.qualifier/filter-fully-qualified topology.qualifier/fq-ns clojure.core/println clojure.core/*err* topology.printer/print-weighted-edges topology.qualifier/all-fq topology.symbols/symbols topology.finder/clojurescript-file? topology.dependencies/sources topology.edgelist-test/edges clojure.java.io/file clojure.tools.namespace.file/clojure-file? topology.symbols/zip-nodes clojure.repl/source-fn topology.dependencies/interns topology.dependencies/dependencies topology.dependencies/ns->fn-dep-map topology.edgelist/interleave-keys topology.edgelist/ns->edgelist topology.edgelist/dirs->fn-edges leiningen.core.eval/eval-in-project leiningen.topology/topology

In [50]:
# Write the svg to a file if everything looks good:
with open(data_path + '.svg', 'wb') as f:
    f.write(svg_data)

To create an updated structure after making new commits:

  • Generate an updated network.
  • Copy the previous position file to use as a starting point for the next visualization.
  • Open Cytoscape or destroy existing collections if Cytoscape is already running.
  • Run all cells to load the visualization.
  • Save the new layout if you make changes to node positions.

Visualizing the network with a Dependency Structure Matrix (DSM)

Each row shows a function's dependencies.

Each column shows callers impacted by the function.


In [42]:
from bokeh.plotting import show, output_notebook
from bokeh.palettes import all_palettes

output_notebook()


Loading BokehJS ...

Create a color palette

We want to choose a color palette so that groups stand out.

Find the number of groups that you have assigned to your vertices:


In [43]:
num_groups = max(set(lg.vs['group']))
num_groups


Out[43]:
19

Now we need to choose an appropriate palette that accomodates this number of groups and achieves the desired visual separation.

As a general rule, you'll need one of the large palettes for > 20 groups.

Pass the name of the palette you choose to the all_palettes function:


In [44]:
palette = all_palettes['Category20'][num_groups + 1]

Determine the order to sort the vertex labels

The matrix visualization will be rendered according to a vertex attribute used as the sort_by parameter in matrix.


In [45]:
lg.vs.attributes()


Out[45]:
['name', 'indegree', 'outdegree', 'label', 'group', 'namespace']

Render the matrix


In [46]:
p = sand.matrix(lg, 'indegree', "{}-{} by indegree".format(network_collection, network_name), 900, palette)
show(p)



In [56]:
p = matrix(lg, 'outdegree', "{}-{} by outdegree".format(network_collection, network_name), 900, palette)
show(p)



In [47]:
p = sand.matrix(lg, 'group', "{}-{} by group".format(network_collection, network_name), 900, palette)
show(p)


Organizing the system by scoring coupling and cohesion

Intuition

Ordering by group / modules gives us a visual indication of how well the system accomplishes the design goal of loosely coupled and highly cohesive modules. We can quantify this idea.

Clustering is a type of assignment problem seeking the optimal allocation of N components to M clusters. One of the prominent heuristics of system architecting is to choose modules such that they are as independent as possible...low coupling and high cohesion.

We can objectively score these clustering algorithms using an objective function that considers both the size of the clusters ($C_i$) and the number of interactions outside the clusters ($I_0$) according to the following equation, where $\alpha = 10$, $\beta = 100$ or $\alpha = 1$, $\beta = 10$, and $M$ is the number of clusters:

$Obj = \alpha \sum_{i=1}^{M}C_i^2 + \beta I_0$

See page 25 of Design Structure Matrix Methods and Applications for more information.

Clustering objectives work against two competing forces:

  • We want to minimize the size of the largest modules...otherwise, we could just take the trivial result of putting everything into one module. M=1
  • We want to minimize the number and/or strength of interactions among components that cross the module boundaries. As we get to more components, more and more interactions will be required to cross module boundaries. The extreme result would be M=N.

The objective function could be evaluated for any number of potential designs that were manually or automatically created. This essentially provides a real-time feedback loop about the potential quality of a design. The range of the function is immediately bound by the two extremes. Your job as an architect and designer is to minimize this function.

Scoring lein-topology

For us to apply this scoring methodology meaningfully, we'll make a couple of simplifying assumptions:

  • clojure.core functions aren't moving to a different namespace.
  • tests shouldn't factor in to how the system is structured.

With these, we can apply the filtering from above a bit more strictly to get an even smaller subgraph of the function call network:


In [48]:
v_to_keep = g.vs(lambda v: 'topology' in v['label'] and not 'test' in v['label'])
tg = g.subgraph(v_to_keep)

# Recompute degree after building the subgraph:
tg.vs['indegree'] = tg.degree(mode="in")
tg.vs['outdegree'] = tg.degree(mode="out")

tg.summary()


Out[48]:
'IGRAPH DNW- 19 18 -- \n+ attr: group (v), indegree (v), label (v), name (v), namespace (v), outdegree (v), weight (e)'

The baseline modularity score of lein-topology's core function dependency graph is:


In [49]:
from sand.modularity import objective

objective(tg, tg.vs['group'])


Out[49]:
121

Where is this on the range of possibilities?

Suppose all functions were in the same namespace. We'll simulate this by setting the group membership vector to all 1's:


In [50]:
objective(tg, [1 for _ in range(len(tg.vs))])


Out[50]:
361

This is the degenerate case of M=1, so the objective function simply returns the square of the number of vertices:


In [51]:
len(tg.vs) * len(tg.vs)


Out[51]:
361

The other extreme occurs when we have the extreme of M=N, or all functions in their own namespace. We can simulate this by providing a unique group membership id for each vertex:


In [52]:
objective(tg, range(len(tg.vs)))


Out[52]:
199

Finally, we can compare our actual modularity score to a computational result. We can use Girvan-Newman edge-betweenness community detection to generate a modular design based on the network structure alone:


In [53]:
eb_membership = sand.edge_betweenness(tg, directed=True)

In [54]:
len(set(eb_membership))


Out[54]:
4

In [55]:
len(set(tg.vs['group']))


Out[55]:
7

So the edge betweenness algorithm comes up with fewer communities, i.e. namespace in this context. Let's see how the modularity score compares:


In [56]:
objective(tg, eb_membership)


Out[56]:
133

If this score is lower than our actual baseline, than the computational community structure may represent an improvement over the current structure. Which namespaces have changed groups? We may wish to refactor the code to reflect this structure.

If the edge betweenness modularity score is higher than our baseline, this fact acts as a quantitative defense of our design.

The novelty here is receiving an algorithmic recommendation about how to improve the organization of the code.

In this case, our current score of 121 is less than the algorithmic optimum of 133, so we can conclude we have a structure with acceptable coupling and cohesion.